Show the code
pacman::p_load(jsonlite, tidygraph, ggraph,
visNetwork, graphlayouts, ggforce,
skimr, tidytext, tidyverse, igraph, wordcloud, cluster)Han Shumin
June 2, 2023
June 15, 2023
With reference to Mini-Challenge 3 of VAST Challenge 2023 and by using appropriate static and interactive statistical graphics methods, I will be revealing the:
Network Analysis: Conduct network analysis on the knowledge graph to extract relevant structural properties and characteristics. This can include measures such as centrality (e.g., degree, betweenness, closeness), community detection that provide insights into the connectivity and organization of the graph.
Anomaly Detection: Apply measures such as degree centrality in the network graph, as it counts how many edges each node has - the most degree central actor is the one with the most ties. For example, a Business Owner who owns only one business should be less suspicious than a Business Owner who owns more than three companies.
The following code chunks will install and load the required packages.
In the code chunk below, from JSON() of jsonlite package is used to import MC3.json into R environment.
The code chunk is used to extract nodes/edges data tables from MC3 list object and save the output in a tibble data frame object called MC3_nodes and MC3_edges.
MC3_nodes <- as_tibble(MC3$nodes) %>%
# distinct() %>%
mutate(country = as.character(country),
id = as.character(id),
product_services = as.character(product_services),
revenue_omu = as.numeric(as.character(revenue_omu)),
type = as.character(type)) %>%
select(id, country, type, revenue_omu, product_services)
glimpse(MC3_nodes)Rows: 27,622
Columns: 5
$ id <chr> "Jones LLC", "Coleman, Hall and Lopez", "Aqua Advance…
$ country <chr> "ZH", "ZH", "Oceanus", "Utoporiana", "ZH", "ZH", "Rio…
$ type <chr> "Company", "Company", "Company", "Company", "Company"…
$ revenue_omu <dbl> 310612303, 162734684, 115004667, 90986413, 81466667, …
$ product_services <chr> "Automobiles", "Passenger cars, trucks, vans, and bus…
mutate() and as.character() are used to convert the field data type from list to character.
To convert revenue_omu from list data type to numeric data type, we need to convert the values into character first by using as.character(). Then, as.numeric() will be used to convert them into numeric data type.
select() is used to re-organise the order of the fields.
Rows: 24,036
Columns: 4
$ source <chr> "1 AS Marine sanctuary", "1 AS Marine sanctuary", "1 Ltd. Liab…
$ target <chr> "Christina Taylor", "Debbie Sanders", "Angela Smith", "Catheri…
$ type <chr> "Company Contacts", "Beneficial Owner", "Beneficial Owner", "C…
$ weights <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
distinct() is used to ensure that there will be no duplicated records.
mutate() and as.character() are used to convert the field data type from list to character.
group_by() and summarise() are used to count the number of unique links.
the filter(source!=target) is to ensure that no record with similar source and target.
In the code chunk below, skim() of skimr package is used to display the summary statistics of mc3_edges tibble data frame.
| Name | MC3_edges |
| Number of rows | 24036 |
| Number of columns | 4 |
| _______________________ | |
| Column type frequency: | |
| character | 3 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| source | 0 | 1 | 6 | 700 | 0 | 12856 | 0 |
| target | 0 | 1 | 6 | 28 | 0 | 21265 | 0 |
| type | 0 | 1 | 16 | 16 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| weights | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 1 | 1 | ▁▁▇▁▁ |
The report above reveals that there is not missing values in all fields.
In the code chunk below, datatable() of DT package is used to display MC3_edges tibble dataframe as an interactive table on the html document.

The above barplot shows the distribution of the type variable in the edge dataframe. From the two bars we can observe that in the type variable they are only two categories “Beneficial Owner”, and “Company Contacts”. Beneficial Owner formed the majority counts and is more than double of the Company Contacts.
In the Datatable of the Edge dataframe, we are able to identify the source variable contains company names, while the target variable contains the persons’ names. Therefore we can conclude that the type variable used to identify whether a person in the target variable belongs to the Beneficial Owner or Company Contacts.
MC3_graph <- tbl_graph(nodes = MC3_nodes1,
edges = MC3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = as.factor(centrality_closeness()))
MC3_graph %>%
filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha=0.5)) +
geom_node_point(aes(
size = betweenness_centrality,
color = closeness_centrality, alpha = 0.5), show.legend = FALSE) +
scale_size_continuous(range=c(1,10))+
labs(title = "Initial Network Visualisation") + # Add the plot title
theme_graph()
From above network graph, even though we have applied the betweenness_centrality, closeness_centrality scores as the size and color of the nodes, we only able to see nodes in the center area tend to have interlink with other nodes nearby, those nodes that located near the boundaries are have less link or only one link. Other than that there are not much insights could be generated by the graph.
MC3_graph <- tbl_graph(nodes = MC3_nodes1,
edges = MC3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness(),
closeness_centrality = as.factor(centrality_closeness())) %>%
filter(betweenness_centrality >= 100000)
# Calculate the degrees of each node
degrees <- degree(MC3_graph)
set.seed (1234)
MC3_graph %>%
# filter(betweenness_centrality >= 100000) %>%
ggraph(layout = "fr") +
geom_edge_link(aes(alpha = 0.5)) +
geom_node_point(aes(size = betweenness_centrality,
color = closeness_centrality,
alpha = 0.5), show.legend = FALSE) +
geom_node_text(aes(label = ifelse(degrees > 3, as.character(id), "")), size = 2) + # Add node labels
scale_size_continuous(range = c(1, 10)) +
labs(title = "Network Visualization with Betweenness centrality \nabove 10000 and degree above 3") + # Add the plot title
theme_graph()
In above network graph, I had to use a filter to exclude betweenness_centrality score which are less than 100000 in order to reduce the amount of nodes to be displayed in the graph. Here, the nodes with more than 3 link will be shown with label name.
Noting that majority of the nodes displayed with label names are companies, however we noticed among the Beneficial Owner displayed, some Beneficial owner’ last name appeared to be the same, eg. John Smith, Jennifer Smith, Amy Williams, and John Willams. It seems the personals with the same last name could potentially coming from the same family, which make them a bit suspicious than others.
In the code chunk below, skim() of skimr package is used to display the summary statistics of MC3_nodes tibble data frame.
| Name | MC3_nodes |
| Number of rows | 27622 |
| Number of columns | 5 |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 1 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| id | 0 | 1 | 6 | 64 | 0 | 22929 | 0 |
| country | 0 | 1 | 2 | 15 | 0 | 100 | 0 |
| type | 0 | 1 | 7 | 16 | 0 | 3 | 0 |
| product_services | 0 | 1 | 4 | 1737 | 0 | 3244 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| revenue_omu | 21515 | 0.22 | 1822155 | 18184433 | 3652.23 | 7676.36 | 16210.68 | 48327.66 | 310612303 | ▇▁▁▁▁ |
The report above reveals that there is no missing values in all fields.
In the code chunk below, datatable() of DT package is used to display mc3_nodes tibble data frame as an interactive table on the html document.

The above barplot shows the distribution of the type variable in the node dataframe. From the above three bars we can observe that in the type variable they have three categories which are “Beneficial Owner”, “Company”, and “Company Contacts”. The type variable used to identify the id belongs to which categories.
In this section, you will learn how to perform basic text sensing using appropriate functions of tidytext package.
The code chunk below calculates number of times the word fish appeared in the field product_services.
# A tibble: 27,622 × 6
id country type revenue_omu product_services n_fish
<chr> <chr> <chr> <dbl> <chr> <int>
1 Jones LLC ZH Comp… 310612303. Automobiles 0
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,… 0
3 Aqua Advancements Sashimi … Oceanus Comp… 115004667. Holding firm wh… 0
4 Makumba Ltd. Liability Co Utopor… Comp… 90986413. Car service, ca… 0
5 Taylor, Taylor and Farrell ZH Comp… 81466667. Fully electric … 0
6 Harmon, Edwards and Bates ZH Comp… 75070435. Discount superm… 0
7 Punjab s Marine conservati… Riodel… Comp… 72167572. Beef, pork, chi… 0
8 Assam Limited Liability … Utopor… Comp… 72162317. Power and Gas s… 0
9 Ianira Starfish Sagl Import Rio Is… Comp… 68832979. Light commercia… 0
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr… 0
# ℹ 27,612 more rows
# A tibble: 27,622 × 6
id country type revenue_omu product_services n_seafood
<chr> <chr> <chr> <dbl> <chr> <int>
1 Jones LLC ZH Comp… 310612303. Automobiles 0
2 Coleman, Hall and Lopez ZH Comp… 162734684. Passenger cars,… 0
3 Aqua Advancements Sashi… Oceanus Comp… 115004667. Holding firm wh… 0
4 Makumba Ltd. Liability … Utopor… Comp… 90986413. Car service, ca… 0
5 Taylor, Taylor and Farr… ZH Comp… 81466667. Fully electric … 0
6 Harmon, Edwards and Bat… ZH Comp… 75070435. Discount superm… 0
7 Punjab s Marine conserv… Riodel… Comp… 72167572. Beef, pork, chi… 0
8 Assam Limited Liabili… Utopor… Comp… 72162317. Power and Gas s… 0
9 Ianira Starfish Sagl Im… Rio Is… Comp… 68832979. Light commercia… 0
10 Moran, Lewis and Jimenez ZH Comp… 65592906. Automobiles, tr… 0
# ℹ 27,612 more rows
The word tokenisation have different meaning in different scientific domains. In text sensing, tokenisation is the process of breaking up a given text into units called tokens. Tokens can be individual words, phrases or even whole sentences. In the process of tokenisation, some characters like punctuation marks may be discarded. The tokens usually become the input for the processes like parsing and text mining.
In the code chunk below, unnest_token() of tidytext is used to split text in product_services field into words.
The two basic arguments to unnest_tokens() used here are column names. First we have the output column name that will be created as the text is unnested into it (word, in this case), and then the input column that the text comes from (product_services, in this case).
By default, punctuation has been stripped. (Use the to_lower = FALSE argument to turn off this behavior).
By default, unnest_tokens() converts the tokens to lowercase, which makes them easier to compare or combine with other datasets. (Use the to_lower = FALSE argument to turn off this behavior).
Now we can visualise the words extracted by using the code chunk below.

The bar chart reveals that the unique words contains some words that may not be useful to use. For instance “a” and “to”. In the word of text mining we call those words stop words. You want to remove these words from your analysis as they are fillers used to compose a sentence.
The tidytext package has a function called stop_words that will help us clean up stop words.
Load the stop_words data included with tidytext. This data is simply a list of words that you may want to remove in a natural language analysis.
Then anti_join() of dplyr package is used to remove all stop words from the analysis.
We can visualise the words extracted again.

df_wordcloud <- stopwords_removed
# Count the frequency of each word
word_frequency <- df_wordcloud %>%
group_by(word) %>%
filter(!word %in% c("character", "0", "unknown")) %>%
summarise(freq = n()) %>%
arrange(desc(freq))
# Create a word cloud
set.seed(1234) # for reproducibility of random colors
wordcloud(words = word_frequency$word, freq = word_frequency$freq, min.freq = 10
,
max.words=200, random.order=FALSE, rot.per=0.35,
colors=brewer.pal(8, "Dark2"))
edges_df <- MC3_graph %>%
activate(edges) %>%
as.tibble()
nodes_df <- MC3_graph %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label)
# Convert the graph to undirected
# MC3_graph_undirected <- as.undirected(MC3_graph)
# Perform community detection using the Louvain algorithm on the undirected graph
communities <- cluster_louvain(MC3_graph)
# Get the cluster membership of each node
membership <- membership(communities)
# Add the cluster membership information to the nodes data frame
nodes_df$group <- membership
# Plot the graph with clustered nodes using visNetwork
visNetwork(nodes_df, edges_df) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE, type = "curvedCW"),
color = list(highlight = "lightgray")) %>%
visOptions(highlightNearest = list(enabled = TRUE, degree = 1, hover = TRUE, labelOnly = TRUE),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visLayout(randomSeed = 1234)Rows: 1,841
Columns: 4
$ label <chr> "8 SE Marine life", "8 SE Marine life", "8 SE Marine life", "Ada…
$ id <int> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 6, 6, 6, 6…
$ group <dbl> 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 4, 4, 5, 5, 5, 5, 5, 5, 5…
$ word <chr> "gloves", "products", "niche", "frozen", "octopus", "cephalopod"…

df <- stopwords_removed
# Filter the data frame
df_extracted <- df %>%
filter(str_detect(word, pattern = "products"))
# Remove duplicate IDs
df_extracted_distinct <- df_extracted %>%
distinct(id, keep_all = TRUE)
# Network visualisation for products related companies
id3 <- MC3_edges %>%
select(source) %>%
rename(id = source)
id4 <- MC3_edges %>%
select(id = target)
MC3_nodes_extracted <- rbind(id3, id4) %>%
distinct() %>%
left_join(df_extracted_distinct, unmatched = "drop")
MC3_graph_extracted <- tbl_graph(nodes = MC3_nodes_extracted,
edges = MC3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness()
) %>%
filter(betweenness_centrality >= quantile(betweenness_centrality, 0.99))
#create Visnetwork graph
edges_df1 <- MC3_graph_extracted %>%
activate(edges) %>%
as.tibble()
nodes_df1 <- MC3_graph_extracted %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label)
# # Convert the graph to undirected
# MC3_graph_undirected <- as.undirected(MC3_graph)
# Perform community detection using the Louvain algorithm on the undirected graph
communities <- cluster_louvain(MC3_graph_extracted)
# Get the cluster membership of each node
membership <- membership(communities)
# Add the cluster membership information to the nodes data frame
nodes_df1$group <- membership
# Plot the graph with clustered nodes using visNetwork
visNetwork(nodes_df1, edges_df1) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE, type = "curvedCW"),
color = list(highlight = "lightgray")) %>%
visOptions(highlightNearest = list(enabled = TRUE, degree = 1, hover = TRUE, labelOnly = TRUE),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visLayout(randomSeed = 1234)df <- stopwords_removed
# Filter the data frame
df_extracted <- df %>%
filter(str_detect(word, pattern = "fish") | str_detect(word, pattern = "seafood"))
# Remove duplicate IDs
df_extracted_distinct <- df_extracted %>%
distinct(id, keep_all = TRUE)
# Network visualisation for fish and seafood related companies
id3 <- MC3_edges %>%
select(source) %>%
rename(id = source)
id4 <- MC3_edges %>%
select(id = target)
MC3_nodes_extracted <- rbind(id3, id4) %>%
distinct() %>%
left_join(df_extracted_distinct, unmatched = "drop")
MC3_graph_extracted <- tbl_graph(nodes = MC3_nodes_extracted,
edges = MC3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness()
) %>%
filter(betweenness_centrality >= quantile(betweenness_centrality, 0.99))
#create Visnetwork graph
edges_df1 <- MC3_graph_extracted %>%
activate(edges) %>%
as.tibble()
nodes_df1 <- MC3_graph_extracted %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label)
# # Convert the graph to undirected
# MC3_graph_undirected <- as.undirected(MC3_graph)
# Perform community detection using the Louvain algorithm on the undirected graph
communities <- cluster_louvain(MC3_graph_extracted)
# Get the cluster membership of each node
membership <- membership(communities)
# Add the cluster membership information to the nodes data frame
nodes_df1$group <- membership
# Plot the graph with clustered nodes using visNetwork
visNetwork(nodes_df1, edges_df1) %>%
visIgraphLayout(layout = "layout_with_graphopt") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE, type = "curvedCW"),
color = list(highlight = "lightgray")) %>%
visOptions(highlightNearest = list(enabled = TRUE, degree = 1, hover = TRUE, labelOnly = TRUE),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visLayout(randomSeed = 1234)df <- stopwords_removed
# Filter the data frame
df_extracted <- df %>%
filter(str_detect(word, pattern = "frozen") )
# Remove duplicate IDs
df_extracted_distinct <- df_extracted %>%
distinct(id, keep_all = TRUE)
# Network visualisation for products related companies
id3 <- MC3_edges %>%
select(source) %>%
rename(id = source)
id4 <- MC3_edges %>%
select(id = target)
MC3_nodes_extracted <- rbind(id3, id4) %>%
distinct() %>%
left_join(df_extracted_distinct, unmatched = "drop")
MC3_graph_extracted <- tbl_graph(nodes = MC3_nodes_extracted,
edges = MC3_edges,
directed = FALSE) %>%
mutate(betweenness_centrality = centrality_betweenness()) %>%
filter(betweenness_centrality >= quantile(betweenness_centrality, 0.99))
#create Visnetwork graph
edges_df1 <- MC3_graph_extracted %>%
activate(edges) %>%
as.tibble()
nodes_df1 <- MC3_graph_extracted %>%
activate(nodes) %>%
as.tibble() %>%
rename(label = id) %>%
mutate(id=row_number()) %>%
select(id, label)
# # Convert the graph to undirected
# MC3_graph_undirected <- as.undirected(MC3_graph)
# Perform community detection using the Louvain algorithm on the undirected graph
communities <- cluster_louvain(MC3_graph_extracted)
# Get the cluster membership of each node
membership <- membership(communities)
# Add the cluster membership information to the nodes data frame
nodes_df1$group <- membership
# Plot the graph with clustered nodes using visNetwork
visNetwork(nodes_df1, edges_df1) %>%
visIgraphLayout(layout = "layout_with_fr") %>%
visEdges(arrows = "to",
smooth = list(enabled = TRUE, type = "curvedCW"),
color = list(highlight = "lightgray")) %>%
visOptions(highlightNearest = list(enabled = TRUE, degree = 1, hover = TRUE, labelOnly = TRUE),
nodesIdSelection = TRUE,
selectedBy = "group") %>%
visLayout(randomSeed = 1234)